Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging

نویسندگان

  • Kyoko Sugisaki
  • Nicolas Wiedmer
  • Heiko Hausendorf
چکیده

In this paper, we present a corpus of over 11,000 holiday picture postcards written in German and Swiss German. We discuss the processes of digitalization, transcription, manual annotation and the development of the automatic text segmentation and part-of-speech tagging. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-149510 Published Version Originally published at: Sugisaki, Kyoko; Nicolas, Wiedmer; Heiko, Hausendorf (2018). Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging. In: 11th edition of the Language Resources and Evaluation Conference, Miyazaki, Japan, May 2018 May 2018. Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging Kyoko Sugisaki, Nicolas Wiedmer, Heiko Hausendorf German department, University of Zurich Schönberggasse 9, 8001 Zurich, Switzerland {sugisaki,nicolas.wiedmer,heiko.hausendorf}@ds.uzh.ch

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Specification of POS Tagging of the Hong Kong University Cantonese Corpus

The Hong Kong University Cantonese Corpus was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was wordsegmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). This scheme, which was designed for tagging written Mandarin texts, e...

متن کامل

From Detecting Errors to Automatically Correcting Them

Faced with the problem of annotation errors in part-of-speech (POS) annotated corpora, we develop a method for automatically correcting such errors. Building on top of a successful error detection method, we first try correcting a corpus using two off-the-shelf POS taggers, based on the idea that they enforce consistency; with this, we find some improvement. After some discussion of the tagging...

متن کامل

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

The goo300k corpus of historical Slovene

The paper presents a gold-standard reference corpus of historical Slovene containing 1,000 sampled pages from over 80 texts, which were, for the most part, written between 1750 – 1900. Each page of the transcription has an associated facsimile and the words in the texts have been manually annotated with their modern-day equivalent, lemma and part-of-speech. The paper presents the structure of t...

متن کامل

A pilot study for a Corpus of Dutch Aphasic Speech (CoDAS)

In this paper, a pilot study for the development of a corpus of Dutch Aphasic Speech (CoDAS) is presented. Given the lack of resources of this kind not only for Dutch but also for other languages, CoDAS will be able to set standards and will contribute to the future research in this area. We have established the basic requirements with respect to text types, metadata, and annotation levels that...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018